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Abstract 

We use nucleosome maps obtained by high-throughput sequencing to study se- 
quence specificity of intrinsic histone-DNA interactions. In contrast with previous 
approaches, we employ an analogy between a classical one-dimensional fluid of finite- 
size particles in an arbitrary external potential and arrays of DNA-bound histone 
octamers. We derive an analytical solution to infer free energies of nucleosome for- 
mation directly from nucleosome occupancies measured in high-throughput experi- 
ments. The sequence-specific part of free energies is then captured by fitting them 
to a sum of energies assigned to individual nucleotide motifs. We have developed hi- 
erarchical models of increasing complexity and spatial resolution, establishing that 
nucleosome occupancies can be explained by systematic differences in mono- and 
dinucleotide content between nucleosomal and linker DNA sequences, with peri- 
odic dinucleotide distributions and longer sequence motifs playing a secondary role. 
Furthermore, similar sequence signatures are exhibited by control experiments in 
which genomic DNA is either sonicated or digested with micrococcal nuclease in 
the absence of nucleosomes, making it possible that current predictions based on 
high-throughput nucleosome positioning maps are biased by experimental artifacts. 

1 Introduction 

In eukaryotes, 75-90% of genomic DNA is packaged into histone-DNA complexes called 
nucleosomes. ' Each nucleosome consists of 147 base pairs (bp) of DNA wrapped around 
a histone octamer in a left-handed superhelix.- Arrays of nucleosomes fold into filamen- 
tous chromatin fibers which constitute building blocks for higher-order structures.'^ DNA 
wrapped in a nucleosome is occluded from interacting with other DNA-binding proteins 
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such as transcription factors, RNA polymerase, and DNA repair complexes." On the other 
hand, histone tail domains act as substrates for post-translational modifications, provid- 
ing binding sites for chromatin-associated proteins which facilitate transitions between 
active and silent chromatin states. ' 

Several distinct factors affect nucleosome positions in living cells. First of all, intrinsic 
histone-DNA interactions are sequence-specific: for example, poly(dA:dT) tracts are well- 
known to disfavor nucleosome formation."''*' In addition, nucleosome-depleted regions can 
be generated through the action of ATP-dependent chromatin remodeling enzymes' and 
histone acetylases.'^ Finally, non-histone DNA-binding factors can alter nucleosome posi- 
tions through binding their cognate sites and either displacing nucleosomes or hindering 
their subsequent formation.'^' ^'^ 

The nucleosome code hypothesis states that DNA sequence is the primary determinant 
of nucleosome positions in living cells. This hypothesis is often contrasted with the idea 
of statistical positioning which asserts that most nucleosomes are ordered into regular 
arrays simply by steric exclusion. ^"'^'^ In this view the nucleosomal arrays are "phased" 
by external boundaries such as DNA-bound factors or DNA sequences unfavorable for nu- 
cleosome formation. It is also possible that a small number of nucleosomes with favorable 
binding affinities create boundaries against which neighboring nucleosomes are ordered 
by steric exclusion. ^^'^■' 

Nucleosome positioning can be thought of as rotational, referring to the 10-11 bp- 
periodic orientation of the DNA helix with respect to the surface of the histone octamer, 
and translational, referring to the 147 bp-long sequence covered by a particular histone 
octamer. Optimal rotational positioning minimizes free energy of sequence-specific DNA 
bending, causing 10-11 bp periodicity of dinucleotide frequencies in nucleosome positioning 
sequences.^'' We use a probabilistic description of translational positioning in which 147 
bp sites with favorable free energies of nucleosome formation have a higher probability to 
be nucleosome-covered. 

To study the contribution of intrinsic histone-DNA interactions to nucleosome po- 
sitioning, several computational models based solely on the DNA sequence have been 
developed. These models can be divided into bioinformatics, which are trained on sets 
of nucleosomal sequences obtained from living cells^^'^'"-^ or from in vitro reconstitution 
experiments,"' and ab initio, which predict nucleosome energies and occupancies using 
DNA elasticity theory and structural data."""'^^ 

Here we develop a physical model for predicting free energies of nucleosome formation 
directly from high-throughput maps of nucleosome positions. Our model employs an 
exact relation between measured nucleosome occupancies and free energies, treating steric 
exclusion rigorously in the presence of histone-DNA interactions of arbitrary strength and 
sequence specificity. We focus in particular on nucleosomes reconstituted in vitro on yeast 
genomic DNA.-" In this case nucleosome locations are affected solely by intrinsic histone- 
DNA interactions and by formation of higher-order chromatin structures. We compare 
our predictions with sequence signals from two nucleosome-free control experiments in 
which DNA was either sonicated or digested with micrococcal nuclease (MNase) to yield 
mononucleosome-size segments. We also test the ability of our in vitro model to predict 



2 



nucleosome positions in vivo and study the universality of nucleosome positioning motifs 
by applying our approach to other organisms. 

2 Results 

Biophysical model of nucleosome occupancy and energetics. 

We have predicted histone-DNA energies genome-wide using an analogy between ar- 
rays of nucleosomes and a one- dimensional fluid of non-overlapping particles of size 147 bp 
in an arbitrary external potential. The nucleosomal array is a discrete version of a one- 
dimensional system of finite-size particles for which Jerry K. Percus showed that particle 
energies can be inferred exactly from the density profile."'' Although our method ne- 
glects formation of three-dimensional chromatin structures which may cause linker DNA 
to adopt preferred lengths,-'""" it rigorously takes into account both steric exclusion be- 
tween neighboring particles and intrinsic histone-DNA interactions, including the 10-11 
bp periodic rotational component. Our approach, outlined in Fig. 1, proceeds in the 
direction opposite to previous work which first employed either bioinformatics or DNA 
elastic theory to construct a sequence-specific histone-DNA interaction potential and then 
positioned nucleosomes on genomic DNA without steric overlap.^' In contrast, we 
employ an exact decomposition from experimentally available nucleosome probabilities 
and occupancies to free energies of nucleosome formation which we call Percus energies 
(Eq. 1). 

To extract the sequence-specific component of nucleosome energetics, we fit Percus 
energies at each genomic bp to a sum of energies of individual nucleotide motifs ranging 
from 1 to bp in length (see Methods). There is no need to construct an explicit 
background model of word frequencies with this approach. The words with the same 
nucleotide sequence have the same energy if they occur anywhere in the 147 bp-long 
nucleosomal site (the position- independent model, Eq. 2), or fall into one of the three 
equal- length regions that span the 147 bp site (the three-region model), or are separated 
by an integer multiple of the 10 bp DNA helical twist (the periodic model). All models are 
constrained to assign non-zero energies to words with nucleotides only if the sequence 
specificity of Percus energies cannot be captured using words with 1 ... A^ — 1 nucleotides 
(see Supplementary Methods). We refer to the maximum length of the words included 
into a model as its order N . In addition, we have developed an order 2 model in which 
mono- and dinucleotides are allowed to have different energies at every position in the 147 
bp- long nucleosomal site (the spatially resolved model, Eq. 3). 

The sequence-specific models provide nucleosome formation energies at each bp in 
S.cerevisiae, C.elegans and E.coli genomes. These energies serve as input to a standard 
recursive algorithm which computes probabilities to start a nucleosome at each genomic 
bp and thus nucleosome occupancies (defined as the probability that a given bp is covered 
by any nucleosome, see Methods). ^^'"- 

A:T/G:C content is the primary determinant of nucleosome sequence prefer- 
ences in S.cerevisiae. 
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The N = 5 position-independent model, which assigns energies to 364 independent 
words from 1 bp to 5 bp in length, is capable of accurately predicting occupancy by 
nucleosomes assembled in vitro on yeast genomic DNA (Fig. 2a,b). Remarkably, even 
though the model is based on Zhang et al. high-throughput nucleosome positioning data 
(yielding r=0.54 on average between Percus energies and sequence-specific energies fit 
independently on 10 DNA segments of equal length spanning the yeast genome, and 
r=0.61 between predicted and observed occupancies),^^ its prediction of Kaplan et al. in 
vitro occupancies' is more accurate (r=0.75), partially due to the 2.85-fold higher sequence 
coverage in the latter dataset. Indeed, the correlation coefficient drops from 0.75 to 0.70 
when sequence reads are randomly removed from the Kaplan et al. map to match Zhang 
et al. level of read coverage. 

The correlation between the two in vitro datasets is rather low (r=0.69), probably 
because Kaplan et al. assembled in vitro chromatin at less than physiological histone 
octamer concentrations. ' The N = 5 model is also highly successful in discriminating 
between high- and low-occupancy regions (dashed curves in Fig. 2c). Its performance is 
comparable to the Kaplan et al. bioinformatics model ' which takes both distributions of 5 
bp-long words in nucleosomes and linkers and position-dependent dinucleotide frequencies 
into account (Supplementary Table 1, dotted curves in Fig. 2c). Occupancies predicted 
by the two models are highly correlated (r=0.89) and thus capture essentially the same 
nucleosome sequence preferences. Note that we report correlations between occupancy 
profiles while Kaplan et al. log-transform occupancies before computing a linear correla- 
tion coefficient: as a result we obtain r=0.79 between Kaplan et al. predicted and in vitro 
occupancies, whereas they report r=0.89 for the same comparison. ' 

However, we find that using 5 bp-long words is not necessary: = 2 ... 4 position- 
independent models are virtually identical to the A^ = 5 model in ranking 5 bp-long 
sequences (Fig. 2d), classifying high- and low-occupancy regions (solid curves in Fig. 2c), 
and predicting in vitro nucleosome occupancies (Supplementary Fig. la,b). The N = 2 
model remains highly correlated with the Kaplan et al. bioinformatics model (r=0.89; 
Supplementary Fig. 2). Remarkably, even the A^ = 1 position-independent model with 
one free parameter (e^ = ct and ec = if both DNA strands are included for each 
mapped nucleosome) retains most of the predictive power of the higher-order models 
(Fig. 2d, Supplementary Table 1), in agreement with a recent independent study.""' Thus 
positions of nucleosomes reconstituted in vitro on the yeast genome are largely controlled 
by the differences in frequencies of A:T and G:C dinucleotides in nucleosomes and linkers. 
In particular, higher-order terms play little role in the energetics of poly(dA:dT) tracts 
(Supplementary Fig. 3). 

Indeed, Fig. 3a shows that DNA sequences of well-positioned nucleosomes (defined 
by 5 or more sequence reads mapped to the same genomic coordinates) are character- 
ized by sharp A:T/G:C discontinuities across the nucleosome boundary. Overall, A:T 
nucleotides are depleted in nucleosomes and enriched in linkers, with the opposite true for 
G:C nucleotides. Although well-positioned nucleosomes make up only 5.4 % of all mapped 
nucleosomes defined by one or more sequence reads, they produce an occupancy profile 
which is highly correlated with the total nucleosome occupancy (r=0.71, with 56.4 % of 
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genomic bps covered by at least one well-positioned nucleosome). In contrast, 81.5 % of 
all nucleosomes are defined by just one or two reads and exhibit little sequence specificity 
(dashed lines in Fig. 3a). Furthermore, the N = 2 position-independent model based 
only on well-positioned nucleosomes is virtually identical to the N = 2 model based on 
all nucleosomes (rank correlation of 0.94 between the two sets of dinucleotide energies). 
Thus our predictions reveal sequence preferences of a subset of nucleosomes that tend to 
occupy unique sites on the DNA. 

Periodic motif distributions do not play a significant role in nucleosome occu- 
pancy predictions. 

Besides the A:T/G:C discontinuities. Fig. 3a reveals two additional features that could 
affect positioning preferences of yeast nucleosomes: prominent 10-11 bp dinucleotide pe- 
riodicity and a particularly strong A:T depletion and G:C enrichment within 20 bp of the 
nucleosome dyad. To test the utility of these features in nucleosome occupancy predic- 
tions we have employed three additional models that either partially or fully differentiate 
between words located at different positions within the nucleosomal site. 

The three-region model assigns different energies to words found in the 47 bp-long 
core and flanking regions and is thus capable of capturing prominent dinucleotide biases 
in the vicinity of the nucleosomal dyad, the periodic model enforces 10 bp helical twist 
periodicity while disregarding global effects, while the most detailed spatially resolved 
model mirrors all three main features exhibited by the frequencies of dinucleotides found 
in nucleosome positioning sequences (Fig. 3b). Nevertheless, these models do not offer 
a significant improvement over the N = 2 position-independent model (Table 1, Supple- 
mentary Table 1), reflecting the fact that all three features are simultaneously present 
in well-positioned in vitro nucleosomes (Fig. 3a) and so knowing any one of them is 
sufficient. Furthermore, global A:T/G:C discontinuities appear to play the role of the 
primary nucleosome positioning determinant, whereas periodic dinucleotide distributions 
and local enrichments can be greatly diminished or absent in other organisms and in in 
vivo nucleosome positioning maps from yeast (Supplementary Fig. 4). 

However, the rotational positioning component of the yeast model should be more 
predictive for nucleosomes positioned on DNA sequences with prominent 10-11 bp dinu- 
cleotide periodicities. Indeed, the spatially resolved model works better than the N = 2 
position-independent model for six non-yeast nucleosomes whose in vitro positions on 
short (< 250 bp) DNA sequences have been determined with single bp precision by hy- 
droxyl radical footprinting (Supplementary Fig. 5). 

In vivo nucleosome positions are partially controlled by the underlying DNA 
sequence. 

We investigated whether simple rules that govern in vitro nucleosome positions remain 
valid in living cells where chromatin structure may be affected by remodeling enzymes and 
by competition with non-histone DNA-binding factors. Indeed, in vivo nucleosomes ap- 
pear to be well-positioned in the vicinity of transcription start and termination sites, with 
prominent nucleosome-depleted regions (NDRs) on both ends of the transcript (Supple- 
mentary Fig. 6; in vivo chromatin comes from cells grown in YPD medium"'). In contrast, 
in vitro nucleosomes are much more delocalized (Supplementary Fig. 7), so that nucleoso- 
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mal arrays around NDRs are not ordered and NDRs themselves are much less pronounced 
(Supplementary Fig. 6).^' 

Despite these differences, dinucleotide energies extracted from in vitro and in vivo nu- 
cleosome positioning maps are reasonably well correlated (Supplementary Fig. 8), yielding 
nearly identical predictions of Zhang et al. in vitro nucleosome occupancies (Supplemen- 
tary Table 1). Although dinucleotide energies inferred from the in vivo map of cross-linked 
nucleosomes are not as close to their in vitro counterparts as the energies based on the 
in vivo map without cross-linking, the two in vivo models yield very similar occupancy 
profiles (r=0.94). This indicates that MNase binding does not significantly alter nucleo- 
some positions. Clearly, the striking oscillations observed in the in vivo occupancy profile 
(Supplementary Fig. 6) are not due to intrinsic sequence preferences but rather involve 
biological factors such as components of transcription initiation machinery that may act to 
position the first nucleosome downstream of the NDR (the so-called +1 nucleosome^ ^).'^^ 
Energetics of nucleosome formation in E.coli and C.elegans. 

To study whether dinucleotide-based nucleosome positioning patterns observed in 
S. cerevisiae extend to other organisms, we have inferred position-independent dinucleotide 
energies from a map of nucleosomes assembled in vitro on the E.coli genome. Although 
the correlation between observed and predicted occupancies was modest in this case (Fig. 
4b), probably because the E.coli genome did not evolve to favor nucleosome formation 
(resulting in lower sequence read coverage in competition with yeast DNA), the dinu- 
cleotide energies were similar in yeast and E.coli (Fig. 4c). The most prominent difference 
was exhibited by the four CG-containing dinucleotides which have the lowest energies in 
S. cerevisiae but occupy middle positions in the case of E.coli (Supplementary Table 2). 

Dinucleotide energies inferred from the in vivo map of C.elegans nucleosomes,''^ while 
an excellent predictor of nucleosome occupancies in the C.elegans genome (Fig. 4a), are 
even further from their yeast counterparts (Fig. 4c): ecc and ecc become comparable to 
^AA/TT (which is the highest in all three organisms), whereas ecc/GG is close to erA which 
is the third most unfavorable in yeast (Supplementary Table 2). It is possible that in vivo 
effects override intrinsic nucleosome preferences in C.elegans. In addition, we find that 
the mononucleotide model is much less predictive in this organism: N = 2 and = 1 
position-independent models yield r=0.65 and r=0.45 correlations with the in vivo map 
to which they were fit, vs. r=0.60 and r=0.54 for the same models applied to S. cerevisiae 
in vitro nucleosomes (Supplementary Table 1). On the other hand, fitting energies of 3 
bp-long words resulted only in a slight (3.0 %) improvement in the correlation coefficient, 
indicating that it is not necessary to keep track of higher-order motifs in C. elegans. 

Although the dinucleotide energies are somewhat different in the three organisms we 
examined, position-independent models from one organism can still be used to predict 
nucleosome positions in another. For example, using the N = 2 E.coli model to predict 
in vitro nucleosome occupancies in S . cerevisiae' ' results in r=0.55, which is only a little 
worse than r=0.60 observed with the "native" model (Supplementary Table 1). The N = 2 
C.elegans model has a correlation of 0.46 with the in vitro occupancy from S. cerevisiae, 
while the correlation between the N = 2 S. cerevisiae in vitro model and the in vivo 
occupancy from C.elegans is 0.52, somewhat lower than 0.65 obtained with the "native" 
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model. Therefore it is possible to make useful predictions in organisms for which high- 
throughput nucleosome positioning maps are not yet available. 

Nucleosome-free control experiments can be used to predict nucleosome po- 
sitions. 

Depletion of A:T and enrichment of G:C-containing dinucleotides in nucleosomal se- 
quences and the discontinuity of dinucleotide frequencies across the nucleosome boundary 
may be an experimental artifact caused by MNase sequence specificity^" rather than a 
reflection of intrinsic histone-DNA interactions. To study this possibility we have exam- 
ined two collections of sequence reads obtained from nucleosome-free control experiments. 
In one experiment a mixture of genomic DNA from S.cerevisiae and E.coli was digested 
with MNase, gel-purified to isolate ~ 150 bp DNA segments, and sequenced. In the other 
experiment DNA was sonicated rather than MNase-digested (see Methods). Because 
DNA segments are approximately constant in length we can compute Percus energies and 
analyze their sequence specificity using N = 2 position-independent models (Fig. 1). 

Surprisingly, both control experiments yield dinucleotide energies that are very close 
to those obtained from the in vitro nucleosome positioning map (Fig. 5). The differ- 
ences are smaller than those between the N = 2 in vitro model and the in vivo model 
based on cross-linked nucleosomes (Supplementary Fig. 8), and are comparable to the 
discrepancies between two N = 2 in vitro models inferred from Kaplan et al. and Zhang 
et al. datasets'^''"'^' (which yield p = 0.96 rank-order correlation coefficient between the 
two sets of dinucleotide energies). As a result, N = 2 models trained on sonication and 
MNase controls predict Kaplan et al. in vitro nucleosome occupancies with correlation 
coefficients of 0.64 and 0.58, respectively, compared with r=0.75 for the nucleosome model 
(Supplementary Table 1). 

The most obvious explanation for this finding is that dinucleotide energies in nucleo- 
somes reflect experimental artifacts. Indeed, the distribution of dinucleotides frequencies 
in MNase-digested DNA segments of mononucleosome size is rather similar to that ob- 
served in nucleosomes, especially in the vicinity of the segment boundary (Supplementary 
Fig. 4e). Alternatively, nucleosome-free controls may be enriched in sequences that re- 
semble nucleosome positioning sequences. For example, sonication tends to break DNA 
segments across the A:T/G:C "fault lines", although the resulting depletion of A:T and 
enrichment of G:C-containing dinucleotides are rather small (Supplementary Fig. 4f). 
The degree to which currently available nucleosome maps are affected by experimental 
artifacts requires further studies which do not rely on MNase digestion or sonication to 
isolate mononucleosome cores. 

3 Discussion 

Nucleosome positioning has been extensively studied using MNase digestion to isolate 
mononucleosomal DNA, followed by either microarray hybridization-^' ^"'• ^^ or high-through- 
put sequencing."'' ^^'^■"''^'"^'-■''■^^ Several bioinformatics models have been fit to these data in 
order to determine intrinsic histone-DNA sequence specificity and its contribution to in 
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vivo chromatin structure. Until recently, these models tended to be quite complex, tak- 
ing both periodic dinucleotide distributions and relative frequences of longer motifs into 
account, '' ^'"^'^'^ although it has also been stated that simple models based on A:T/G:C 
content and related descriptors are sufficient for predicting nucleosome occupancies.^"''^" 
However, these findings may be subject to MNase and sequencing biases which inevitably 
affect currently available nucleosome maps. Besides, bioinformatics models are not capa- 
ble of directly predicting nucleosome formation energies. 

To study these issues, we have developed a biophysical approach to inferring nu- 
cleosome energies and occupancies from high-throughput sequencing data. The effects 
of steric exclusion is rigorously separated from intrinsic histone-DNA interactions un- 
der the assumption that nucleosomes form a one-dimensional array in which there are 
no nucleosome-nucleosome interactions besides nearest-neighbor steric hindrance. This 
assumption amounts to neglecting intrinsic structure of the chromatin fiber which is be- 
lieved to impose "quantized" linker lengths.""' ^''^'■^'^^ Furthermore, we assume that the 
one-dimensional nucleosome array is in thermodynamic equilibrium, with individual nu- 
cleosome positions corresponding to the lowest free energy state of the entire array. In 
vivo nucleosomes may not be in equilibrium due to the action of chromatin remodeling 
enzymes and other energy-dependent processes. 

We find that most mapped nucleosomes are not sequence-specific. However, well- 
positioned nucleosomes defined by five or more sequence reads tend to occupy G:C- 
enriched and A:T-depleted DNA segments in S.cerevisiae (Fig. 3a). These nucleosomes 
alone define an occupancy profile which is highly correlated (r=0.71) with the profile 
based on all mapped nucleosomes. Thus A:T and G:C-containing dinucleotide content is 
different in nucleosomal and linker sequences and is highly predictive of nucleosome po- 
sitions. More complex models that take rotational positioning into account do not yield 
siginificantly improved predictions (Table 1), indicating that 10 bp dinucleotide period- 
icites alone do not define nucleosome positions on the yeast genome. It is possible that 
nucleosomal sequences first evolved to be G:C-rich and then acquired 10 bp dinucleotide 
periodicity to take advantage of the rotational positioning mechanism. 

Surprisingly, models trained on DNA segments from nucleosome-free control exper- 
iments can be used to predict nucleosome occupancies (Fig. 5, Supplementary Table 
1). It is possible that experimental biases obscure nucleosome positioning signals in cur- 
rent high-throughput experiments. Alternatively, DNA from control experiments may be 
enriched in nucleosome positioning sequences because A:T/G:C discontinuities make it 
easier for sonication or MNase to break DNA at the nucleosome boundary even in the 
absence of nucleosomes. MNase- and sonication-free nucleosome positioning maps are 
required to resolve this issue. 

In summary, nucleosome sequence preferences can be captured using a simple physical 
model based on dinucleotide content. Promoter regions are unfavorable for nucleosome 
formation, while -|-1 nucleosomes have lower energies and help define nucleosome array 
boundaries, although sequence preferences alone are not strong enough to explain perfectly 
phased in vivo arrays (Supplementary Figure 6). Similar nucleosome-positioning rules 
can be extracted from in vitro and in vivo chromatin (Supplementary Figure 8), strongly 
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suggesting that nucleosomes tend to occupy thermodynamically favorable positions in 
living cells."' 

4 Materials and Methods 

Parallel sequencing and mapping of in vitro and in vivo nucleosomes. 

Nucleosomes were reconstituted in vitro on S.cerevisiae and E.coli genomic DNA as 
follows:""' genomic DNA from S.cerevisiae and E.coli was purified using Qiagen genomic 
tip 500/G. Yeast and bacterial DNA were mixed at a 3:1 mass ratio. 5-10 kb DNA 
segments (obtained by light sonication of purified genomic DNA) were assembled into 
chromatin by salt dialysis and extensively digested with MNase to yield mononucleosome 
core particles. Mononucleosomal DNA was purified by excision from an agarose gel and 
sequenced using lUumina Genome Analyzer. This procedure resulted in a collection of 
3239990 25 bp- long sequence reads (0.27 reads per bp) mapped to the S.cerevisiae genome 
(SGD April 2008 build) and 336338 reads (0.07 reads per bp) mapped to the E.coli K12 
genome (U00096), allowing up to 2 mismatches per read. 

In control experiments, mixture of genomic DNA from S.cerevisiae and E.coli was 
prepared as described above. Part of this mixture was treated with MNase (USB) to 
yield a small average fragment size (< 300 bp), and DNA fragments of approximately 150 
bp were purified by excision from an agarose gel. A second fraction of the yeast/bacterial 
DNA mixture was subjected to sonication in a Misonix water-bath instrument to yield 
an average fragment size of 150bp. Mononucleosome-size DNA fragments were sequenced 
using Illumina Genome Analyzer, yielding 1160528 reads mapped to yeast (0.10 reads per 
bp) for the MNase-digested fraction and 1326882 reads mapped to yeast (0.11 reads per 
bp) for the sonicated fraction. 

We have also used maps of in vivo nucleosomes prepared from log-phase yeast cells 
grown in YPD medium."' In two replicates nucleosomes were cross-linked with formalde- 
hyde prior to MNase digestion, and in four replicates the cross-linking step was omitted. 
We have combined sequence reads in each case, resulting in 0.50 and 1.50 mapped reads 
per bp respectively. Kaplan et al. also provide two replicates for nucleosomes recon- 
stituted in vitro on yeast genomic DNA, with the total of 0.77 mapped reads per bp."' 
C.elegans nucleosomes came from mixed stage, wild-type (N2) cells. '' Mononucleosome 
cores were liberated with MNase and sequenced on a SOLID Analyzer (Applied Biosys- 
tems). We have used sequence read coordinates provided by the authors, yielding 0.44 
reads per bp. 

Pre-processing of nucleosome sequence reads. 

We assume that genomic coordinates of 25 bp-long mapped sequence reads define 
nucleosome positions on yeast and E.coli genomes. We extend all mapped reads to the 
147 bp canonical nucleosome length and combine reads from both strands (Supplementary 
Methods). This procedure yields the number of nucleosomes that start at each genomic 
bp (the sequence read profile; Fig. la), as well as the number of nucleosomes that cover 
a given bp (the nucleosome coverage profile). We control for sequencing and mapping 
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artifacts by removing regions with anomalously high and low nucleosome coverage from 
further consideration (Supplementary Methods). 

Next we smooth the sequence read and nucleosome coverage profiles by replacing the 
number of nucleosomes starting at each bp with a Gaussian centered on that bp (Fig. 
Ib).^^'^'^' The area of the Gaussian is equal to the number of sequence reads starting at 
that position, while its a is set to either 2 or 20. Gaussian smoothing is necessary because 
current levels of sequence read coverage lead to large deviations in the number of nucle- 
osomes located at neighboring bps, contrary to the expectation that such nucleosomes 
should have very similar binding affinities because they occupy nearly identical sites. 
The effect of Gaussian smoothing can be seen in Supplementary Fig. 9. 

Finally, we normalize the sequence read and nucleosome coverage profiles by the high- 
est value of nucleosome coverage on the chromosome. We interpret the resulting nor- 
malized profiles as the probability to start a nucleosome at a given bp (the nucleosome 
probability profile) and the probability that a given bp is covered by any nucleosome (the 
nucleosome occupancy profile). 

Prediction of nucleosome energetics from high-throughput sequencing maps. 

We derive nucleosome formation energies directly from the smoothed probability and 
occupancy profiles, under the assumption that observed nucleosome positions are affected 
solely by intrinsic histone-DNA interactions and steric exclusion (Supplementary Meth- 
ods): 

E^-ii , l-Oi + Pi ^ ™, 1-0, 

log . X 1„„ 



kBT 



Elog^^^, .-1,...,L-146 (1) 



Here Ei is the Percus energy at bp i, /i is the chemical potential of histone octamers, 
ksT is the product of the Boltzmann constant and room temperature, L is the number 
of bps in the DNA segment, Pj is the probability to start a nucleosome at bp i, and Oj is 
the nucleosome occupancy of bp i {Oi = X]j=i-i46 ^j)- 

We establish the degree of correlation between Percus energies and sequence features 
found in nucleosomal and linker DNA by fitting them to four sequence-specific models 
(Fig. Ic). The position-independent model of order is given by: 

^ rp = ^2 ^ai...a„^ai...an + , (2) 

n=l {ai...an} 

where A^ is the maximum word length, e° is the total offset, and „ is the number of 
times a word of length n with sequence was found within the nucleosome that 

started at bp i. ea^,,,a^ are word energies which do not depend on the nucleosome position 
i. The word energies are constrained by eQ,^...Q,„ =0, = 1 . . . n, which leaves S"' 
independent words of length n. We exclude all words that extend into 3 terminal bps on 
each end of the 147 bp nucleosomal site from our counts. 
The spatially resolved model is defined by: 
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i+143 



i+144 



(3) 



j=i+3 



j=i+3 



where the mono- and dinucleotide energies are constrained as above at each position within 
the nucleosomal site. The three-region model and the periodic model are described in Sup- 
plementary Methods. We use Gaussian smoothing with a = 20 for position-independent 
and three-region models and ex = 2 for spatially resolved and periodic models. 

Eqs. (2) and (3) define linear models which we fit against Percus energies using 
function Im from R statistical software {http : //www. r-pro ject.org) (Fig. Ic). For compu- 
tational reasons the genome is divided into several segments of equal size and a separate 
model is trained on each segment (Supplementary Fig. 10). The final energy of each 
word is the average over all models. We restore the dynamic range of fitted energies by 
rescaling their variance to the variance of the Percus energies on which they were trained, 
separately for each chromosome. Finally, we predict nucleosome probabilities and occu- 
pancies from fitted energies using a standard recursive algorithm (Fig. Id; Supplementary 
Methods). Our predictions and software are available on the Nucleosome Explorer 
website, http://nucleosome.rutgers.edu. 
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Figure 1. Outline of the biophysical approach to nucleosome occupancy predictions: 
GALl-10 S. cerevisiae locus, a) Nucleosome starting positions mapped to the GALl-10 locus in 
the in vitro reconstitution experiment. b) Nucleosome occupancy based on the nucleosome starting 
positions shown in (a) and smoothed with a u = 20 Gaussian (see Supplementary Methods), c) 
Percus energy inferred from the occupancy profile shown in (b), and a sequence-specific linear model 
fit to an = 2 position-independent model, d) Nucleosome occupancy predicted using sequence- 
specific energies and compared with the experimental occupancy based on the nucleosome starting 
positions shown in (a) (same as (b) but without Gaussian smoothing), e) Nucleosomes are positioned 
over G:C-rich sequences: shown are nucleotide counts in the GALl-10 locus, smoothed with a 100 
bp moving average. 
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Figure 2. Position-independent model predicts in vitro nucleosome occupancy in 
S.cerevisiae with high accuracy, a) Density scatter plot for the nucleosome occupancy at each 
genomic base pair predicted with the N = 5 position-independent model vs. in vitro occupancy 
observed by Zhang et alP The color of each region represents the number of data points mapped 
to that region. Our model is fit on this data (see Methods), b) Same as (a) except the in vitro occu- 
pancy is from Kaplan et air" c) The receiver operating characteristic (ROC) curve for discriminating 
between DNA segments with high and low nucleosome occupancy. The yeast genome was parsed 
into 500 bp windows and the average nucleosome occupancy was computed for each window. 5000 
windows with the highest and 5000 with the lowest average occupancies were ranked high-to-low us- 
ing occupancies predicted with the N = 2 position-independent model, N = 5 position-independent 
model, and Kaplan et al. model. For each partial list of ranked windows with 1, . . . , 10000 entries 
we plot the number of windows in the list known to have high occupancy on the y-axis, low occupancy 
on the X-axis, d) Rank-order plots of energies of 5 bp words: the energy of each word is ranked using 
position-independent models of order = 1 through = 4 and compared with the N = 5 model. 
Each curve shows the number of words whose ranks are separated by a given distance or less. 
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Figure 3. Dinucleotide distributions in nucleosome and linker sequences, a) Upper panel: 
average relative frequencies of WW (AA, TT, AT and TA) and SS (CC, GG, CG and GC) dinucleotides 
at each position within the nucleosome are plotted with respect to the nucleosome dyad. The relative 
frequency of each dinucleotide is defined as its frequency at a given position divided by genome-wide 
frequency. All frequencies are smoothed using a 3 bp moving average. Solid lines: well-positioned 
nucleosomes defined by five or more sequence reads, dashed lines: bulk nucleosomes defined by one 
or two sequence reads. Lower panel: heat map of relative frequencies for each dinucleotide, plotted 
with respect to the nucleosome dyad. Nucleosomes were assembled in vitro on the yeast genome 
using salt dialysis. b) Average energies of WW (AA, TT, AT and TA) and SS (CC, GG, CG and 
GC) dinucleotides at each position within the nucleosome predicted with the N = 2 spatially resolved 
model are plotted with respect to the nucleosome dyad. 
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Figure 4. Prediction of nucleosome occupancies in C.elegans and E.coli. Density scatter 
plots for the nucleosome occupancy at each genomic base pair (predicted with the N = 2 position- 
independent model) vs. in vivo occupancy in C.elegans (a) and in vitro occupancy in E.coli-^ 
(b). Rank-order plots of energies of 2 bp words (c): the energy of each word is ranked using a 
position-independent model of order N = 2 trained on either in vitro {S.cerevisiae, E.coli) or in vivo 
{C.elegans) nucleosome positioning data. Each curve shows the number of words whose ranks are 
separated by a given distance or less in the C.elegans and E.coli vs. S.cerevisiae fits. 
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Figure 5. Nucleosome-free control experiments yield sequences with nucleosome-like 
dinucleotide distributions. Rank-order plots of energies of 2 bp words: the energy of each word 
is ranked using a position-independent model of order N = 2 trained on either in vitro nucleosome 
positioning sequences or fragments of mononucleosomal size obtained from sonication and MNase 
digestion assays of nucleosome-free yeast DNA. Each curve shows the number of words whose ranks 
are separated by a given distance or less in the sonication and MNase digestion vs. nucleosomal fits. 
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Tables 



Table 1. Position-dependent distributions of sequence motifs do not yield improved nucle- 
osome occupancy predictions. Shown are correlation coefficients between two in vitro nucle- 
osome occupancy profiles^' and four N = 2 models, and between the position-independent 
model and three models with position-dependent energies. 
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Supplementary Material 

S.l Supplementary Methods 

S.1.1 Preprocessing high-throughput sequencing data 

Mapping sequence read profiles. 

We start from a collection of 25 bp-long Solexa sequence reads uniquely mapped onto 
the yeast genome with no more than two mismatches. Each read is mapped onto either 
the forward (5') or the reverse (3') strand. For sequence reads mapped onto the forward 
(5') strand, we interpret the first base of a read as the start position of a nucleosome with 
the canonical length of 147 bp. For sequence reads mapped onto the reverse (3') strand, 
we interpret the last base of the read as the end position of a 147 bp nucleosome. Thus we 
create a "sequence read profile" , a table which shows the number of nucleosomes starting 
at each genomic bp. This table is used to create a "read coverage profile" which shows 
how many nucleosomes cover each genomic bp. 
Filtering sequence read profiles. 

We observe that there are large gaps in our read profiles, possibly due to repetitive 
regions in the genome to which reads cannot be mapped uniquely, or to sequencing arti- 
facts. We considered any stretch of > 1000 bp without mapped reads to be anomalous 
and excluded such regions from further analysis. We also find regions where the read cov- 
erage was uncharacteristically high. For instance, our in vitro nucleosome measurement 
for chromosome 12 has an average nucleosome coverage of ~ 80 reads, but there is a small 
region near bp 460000 covered with 5000 reads. We exclude such regions according to the 
following procedure: For each chromosome, we find the average number of reads per bp. 
Next, for each bp we calculate the running average number of reads in a window extend- 
ing 75 bp in each direction. If this running average is more than three times the mean, 
we flag the region which extends out from the identified point in both directions until 
the running average equals the mean, and we remove this region from consideration. We 
then create a filter which marks the union of all excluded regions. Finally, each excluded 
region is extended 146 bp upstream so that there is no contribution to the nucleosome 
energy from filtered regions. 
Normalizing sequence read profiles. 

Next we use the sequence read profile to create nucleosome probability and occupancy 
profiles. First, we set sequence read counts to zero inside all filtered regions. Second, 
we use a Gaussian smoothing algorithm that replaces the number of sequence reads at a 
given bp with a normal distribution centered at that bp. The Gaussian is chosen to have 
cr = 2 or 20 depending on subsequent modeling, and the area under the curve is equal 
to the number of sequence reads at that bp. The smoothed sequence read profile is then 
constructed as a superimposition of all such Gaussians. 

The smoothing procedure reflects a lack of bp precision in MNase digestion assays, 
which results in the uncertainty of the interpretation of sequence read coordinates as 
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nucleosome start or end positions. In addition, because neighboring nucleosomes are ex- 
pected to have similar binding affinities, collecting more sequence read data is assumed to 
result in a read proffie that we approximate with the superposition of normal distributions 
centered on available reads. 

We extend the smoothed read proffie into a smoothed read coverage proffie as described 
above, find the highest point Nmax in the smoothed coverage profile and multiply the 
height of each point in the smoothed coverage profile and the smoothed read profile by 
'^/Nmax SO that the maximum coverage is one. Each point in the smoothed sequence read 
profile may now be interpreted as the probability for a nucleosome to start at a given 
position, and the coverage may be interpreted as the probability for any nucleosome to 
occupy a given position. We refer to the scaled results as nucleosome probabihty and 
occupancy profiles, respectively. 

S.1.2 Energetics of DNA-binding one-dimensional particles of 
finite size 

Consider particles of size a bp distributed along a DNA segment of length L bp. The 
particles can interact with DNA in a position-dependent manner and are also subject to 
steric exclusion (adjacent particles cannot overlap). A grand-canonical partition function 
for this system of DNA-bound particles is given by: 

Z = '^^~[E{conf)~t,N{conf)]^ ^g_^^ 
conf 

where conf denotes an arbitrary configuration of DNA-bound non-overlapping particles, fi 
is the chemical potential, and E{conf) and N{conf) are the total DNA-binding energy and 
the number of particles in the current configuration (for simplicity we assume = 1, 
where ks is the Boltzmann constant and T is the room temperature). 
One can compute Z efficiently using a recursive relation: 



Z/ = ZU + zi^e-'^^-^^-^\ z = a,...,L (S.2) 

zU = --- = zl = i 

which computes a set of partial partition functions in the forward direction. Likewise, 
partial partition functions can be computed in the reverse direction: 



Zl = Zl^, + Zl^^^^^-^, z = L-a + l,...,l (S.3) 



i+l i+a 
^L~a+2 — ■ ■ ■ — 



Note that Z[ = Zl = Z hj construction. Furthermore, the probability of starting a 
particle at position i is given by: 
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z ' 

Intuitively, Eq. (S.4) is a partition function for all configurations in which a particle is 
bound at position i (occupying positions % through z + a — 1), divided by the partition 
function for all possible configurations. Using Eqs. (S.2), (S.3) and (S.4) we obtain: 

Zl — Z(_^ = Pi_a+iZ / Zl^^, i = a,. . . ,L , , 

Z;^, - Zl = -P^Z/ZU z = L- a + ^"^ 

Note that both of these formulas can be extended to the i = 1, . . . , L range if we assume 
that Pfc = , A; ^ [1 , L - a + 1] . It is easy to show that Z/ ZJ^^ - Z(_ ^ Z[ = Z (P^.^+i -Pi). 
This expression has the form of a complete differential and thus can be iterated as follows: 

L 

ZiZl^, - ZUZI = Zj2iP,-a^i - P,), (S.6) 



yielding 



i-l 

ZUZI = Z{1- J2 P^)^ i = l.....L (S.7) 

j=i—a+l 

Using Eqs. (S.3), (S.4) and (S.7) we get: 

ZUi = ZI f 1 - -^^^ -] . (S.8) 

Introducing Oi = J2]=i^a+i Pj ' probability that position i is covered by any 
particle regardless of its starting position (also called the particle occupancy), we see 
that: 

Using Eq. (S.9) recursively (until Z'^^-^ = 1 is reached on the left-hand side), we obtain 
an explicit expression for Z^: 

j=i 

Likewise, using Eqs. (S.2), (S.4) and (S.7) together with Zq = 1 we get: 
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Eqs. (S.IO) and (S.ll) are explicit expressions for forward and reverse partial partition 
functions in terms of particle probabilities and occupancies. Note that Z\ = Z^^ = Z still 
holds, with Eqs. (S.IO) and (S.ll) providing alternative expressions for the partition 
function in this limit. Inserting Eqs. (S.IO) and (S.ll) into Eq. (S.4) and using Eq. (S.7) 
to express Z/^^ in terms of Zl leads to the desired expression for the DNA-binding energy 
of the particle at position i: 

i^.-/^ = log^^^^ + ^£'log^-i^^, ^ = l,...,L-a+l (S.12) 

Alternatively, we can use Eq. (S.7) to express in terms of Z-_^^_-^^, leading to an 
equivalent expression for the DNA-binding energy: 

Ei-fi = \og + log . ^ p , « = l,...,L-a + l (S.13) 



S.1.3 Hierarchical models of nucleosome energetics 

We have created hierarchical models of nucleosome energetics which assign non-zero ener- 
gies to nucleotide motifs of length N only if the nucleosome energies cannot be explained 
using nucleotide motifs of lengths 1 ... iV — 1. This is implemented using constraints on 
word energies: 

= 0, Vz = l...Ar (S.14) 

Here eQi...Q,^ is the energy of the word of length with nucleotides ai . . . oat at positions 
1...N. 

With these constraints and the {A, C, G, T} alphabet there are 3^ independent param- 
eters describing energetics of words of length A^. For example, for A^ = 1 we can choose 
{eA, cg, ct} to be independent, while ec is fixed by the constraint: ec = —{^a + + ct)- 
For N = 2 there are 9 independent parameters: {eAA, ^ag, ^at, ^ga, ^gg, ^gt, ^ta, ^tg, ctt}, 
while the other 7 dinucleotide energies can be expressed through these using Eq. (S.14). 
The remaining 7 degrees of freedom are described by the lower order terms: 6 e^'s (3 for 
each position in the dinucleotide) and the total offset e°. 

In general, degrees of freedom associated with words of length A^ drawn from an 
alphabet of size D can be described using constrained energies: 

D^ = {D- If + (D - If^' + . . . + Q (Z^ - 1)0, (S.15) 

where each term describes the total number of constrained energies of order (A^, . . . , 0), 
computed as a product of the number of constrained energies at each possible position 
within the longer word, and the number of such positions. Note that the zeroth order 
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term is simply the total offset e''. Furthermore, shorter words comprised of non-consecutive 
nucleotides are included in the expansion. If we set the energies of all non-consecutive 
words to zero, the total energy of a word of length can be written as: 

4...aiv = Zl ea,. . +e° (S.16) 

n=l j=l 

Note that here and in Section S.1.4 below we set /i = for simplicity. Although a 
set of constrained energies of order 0, . . . , on the right-hand side of Eq. (S.16) has 
fewer degrees of freedom than a set of unconstrained energies of order A^, it provides the 
most complete description involving consecutive nucleotide motifs, and forms a basis of 
nucleosome models that have been further simplified by equating energies of motifs that 
occur at different positions within the nucleosomal site. Furthermore, since dinucleotides 
are too short to contain partial non-consecutive motifs, Eq. (S.16) entails no loss of 
degrees of freedom for N = 2. 

S.1.4 Sequence-specific models of nucleosome energetics 

Eq. (S.12) can be used to convert nucleosome probabilities and occupancies obtained from 
high-throughput sequencing data into histone-DNA interaction energies for each position 
i along the DNA, under the assumption that steric exclusion and specific interactions with 
DNA are the only factors that affect nucleosome positions in vitro. In order to understand 
which DNA sequence features explain the observed energy profile, we carried out linear 
fits of genome-wide Percus energies (Eq. (S.12)) to four sequence-specific models. Some 
models were designed to focus on the ~ 10 — 11 bp periodic distributions of sequence 
motifs, while others capture nucleosome-wide sequence signals such as motif enrichment 
and depletion in nucleosome-covered sequences. 
Spatially resolved model. 

In terms of unconstrained energies, the spatially resolved model is defined as: 

E{S) = f^e'^^^^^^, (S.17) 

where E{S) is the nucleosome formation energy of a 147 bp-long sequence 5", Cq^q^^j^ is the 
energy of the dinucleotide with bases and aj+i at positions i and i + 1 respectively, and 
the sum runs from /i > 1 to /2 < 147 in the nucleosomal site. To minimize edge effects, 
we typically exclude 3 bps from each end of the nucleosome, setting /i = 4 and I2 = 144. 
Eq. (S.17) can be rewritten as: 

/2-1 

E{S) = ^ (e„,„,^, + + b^^J + e°, (S.18) 

i=Ii 
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where 



^ h-l D h-l 

^ = 

i=/i a, 13=1 i=I\ 



1 ^ 



/3=1 



1 ^ 



Q = l 



Note that Yl^=i ^ap = S|a=i ^ap = by construction. Eq. (S.18) is equivalent to the 
expansion in terms of constrained energies which is consistent with Eq. (S.16): 



E{S) = J2^c..a.,,+J2'o..+e', (S.19) 

i=Ii i=Ii 

where eaj^ = ha;^ , e^j^+i = ^a/^+i • • • , ^a,^ = baj^ . Thus an unconstrained descrip- 

tion of nucleosome energetics can be uniquely decomposed into a constrained description. 
However, the opposite is not true: for any p and q such that p + q = 1 



e' 






e' 




+ PCoi + Q^a.+ i , /l < 2 < /2 - 1 


e' 







are equally valid reconstructions that leave E{S) unchanged. In this paper we use p = 
l,g = to compute unconstrained dinucleotide energies from constrained ones. 
Position-independent model. 

This model assigns the same energy to a given word within the nucleosome, regardless 
of its position in the site. Thus the position-independent model of order is given by: 

N 3" 
n=l {ai...a„} 

where the outer sum is over word lengths, the inner sum is over all words of length n 
corresponding to constrained energies, nai...an is the number of words with the nucleotides 
ai . . . a„ at positions 1 . . .n, and eai...a„ are word energies constrained by Eq. (S.14). As 
in the spatially resolved model, the words are counted from bp Ji = 4 to bp I2 = 144, 
excluding 3 bp from each end of the site. The words are not allowed to extend outside 
this region. Note that both in this model and in the two partially position-dependent 
models described below there is no one-to-one correspondence between constrained models 
utilizing words of order 1 . . . and their unconstrained counterparts utilizing words of 
order A^ - the former require fewer fitting parameters. 
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Three-region model. 

This model refines the position-independent model by dividing the 141 bp nucleosome 
site into 3 regions of equal length. Word energies are fitted separately inside each region. 
The total energy of sequence S is then given by: 

3 N 3" 
r=l n=l {ai...a„} 

where r refers to a particular 47 bp region. 
Periodic model. 

This model enforces DNA helical twist periodicity by equating the energies of words 
separated by a multiple of 10 bp. To reduce the number of fitting parameters, we also 
grouped energies of words at positions 1 ... 10 into 5 distinct bins. Thus an AGT motif 
starting at position 1 within the nucleosome site would have the same energy as the 
AGT motif starting at positions 11, 21, 31 . . . as well as positions 2, 12, 22 ... , whereas 
the energy of the same motif starting at positions 3 and 4 is grouped into a different bin. 
The total energy is then computed as: 

5 N 3" 
6=1 ji=l {ai...a„} 

where b is the bin index used to group motifs separated by the helical twist as described 
above. As before, all words overlapping with the 3 bp edge regions are excluded from the 
counts. 
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Supplementary Figures 
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Supplementary Figure 1. N = 2 position-independent model is sufficient to explain 
nucleosome occupancy in S.cerevisiae. a) Density scatter plot for the nucleosome occupancy 
at each genomic base pair (predicted with the N = 2 position-independent model) vs. in vitro 
occupancy observed by Zhang et al. b) Same as (a) except that in vitro occupancy is from Kaplan 

et al/ 
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Supplementary Figure 2. Similar predictive power of the N = 2 position-independent 
model and a bioinformatics model based on periodic dinucleotide distributions and frequen- 
cies of 5 bp-long words. Density scatter plot for the nucleosome occupancy at each genomic base 
pair (predicted with the N = 2 position-independent model) vs. nucleosome occupancy predicted 
by Kaplan et al."' 
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Supplementary Figure 3. Minor role of the higher-order contributions to the energies 
of 5 bp-long words. N = 5 position-independent model was trained on nucleosomes reconstituted 
in vitro on the yeast genome, yielding energies of all motifs of 1 through 5 bp in length. Energies 
of 5 bp-long words were then computed by summing contributions from a subset of shorter motifs: 



E{S) = En=LE{ai...a„}^ai... On an. where ?iai...Q„ is the number of times a given word was 
found in the 5 bp-long sequence S and eai...an is the fitted energy of that word. L = 5 . . . 1 is the 
length of the shortest motif included into E{S). Grey: all 5 bp-long words, black: A:T-containing 
words, green: the poly(dA:dT) tract (AAAAA). 
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In vitro (Kaplan etal.) 
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Supplementary Figure 4. Dinucleotide distributions in nucleosome and linker se- 
quences. Upper panel: average relative frequencies of WW (AA, TT, AT and TA) and SS (CC, 
GG, CG and GC) dinucleotides at each position within the nucleosome are plotted with respect to 
the nucleosome dyad. The relative frequency of each dinucleotide is defined as its frequency at a 
given position divided by genome-wide frequency. All frequencies are smoothed using a 3 bp moving 
average. Lower panel: heat map of relative frequencies for each dinucleotide, plotted with respect 
to the nucleosome dyad, a) Nucleosomes assembled in vitro on the yeast genome (defined by more 
than five sequence reads), from Kaplan et al. b) In vivo nucleosomes (defined by more than five 
sequence reads) from yeast cells grown in YPD medium.-' Upper panel: dashed lines - cross-linked 
nucleosomes, solid lines - no cross-linking. Lower panel: dinucleotide counts based on a combination 
of all YPD replicates, c) Nucleosomes assembled in vitro on the E.coli genome (defined by more 
than one sequence read), d) In vivo nucleosomes (defined by more than three sequence reads) from 
C.elegans. e) Same as (a)-(d) except the dinucleotide frequencies are from mononucleosome-size 
DNA sequences (defined by more than five sequence reads) from yeast genomic DNA digested by 
MNase in the absence of nucleosomes. f) Same as (e) except mononucleosome-size DNA sequences 
(defined by more than one sequence read) were obtained by sonication. 




Sequence position (bp) Sequence position (bp) Sequence position (bp) 

Supplementary Figure 5. Prediction of six nucleosome positions mapped in vitro at 
high resolution. Shown are nucleosome formation energies computed using the N = 2 position- 
independent model (green curves) and the spatially resolved model (blue curves). Vertical lines: 
known nucleosome starting positions, also listed in parentheses below, (a) The 180 bp sequence 
from the sea urchin 5S rRNA gene (bps 8,26).^" (b) The 183 bp sequence from the pGUB plasmid 
(bps 11,31)."^' (c) The 215 bp fragment from the sequence of the chicken /3 — globin^ gene (bp 
52). 38 (d,e,f) Synthetic high-affinity sequences 601 (bp 61), 603 (bp 81) and 605 (bp 59). 
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Supplementary Figure 6. Nucleosome energies and occupancies in the vicinity of 
transcription start and termination sites, a) Percus energy (red) and the sequence-specific energy 
predicted using the N = 2 position-independent model (blue). The energies were inferred from 
nucleosomes positioned in vitro on the yeast genome, averaged over all genes for which transcript 
coordinates were available, and plotted with respect to the transcription start and termination 
sites (TSS and TTS, respectively). All energies were divided by a genome-wide average, b) In 
vitro nucleosome occupancy (red),-^ in vivo nucleosome occupancy in YPD medium without cross- 
linking (blue),^ and occupancy predicted using the N = 2 position-independent model (black). All 
occupancies were divided by the genome-wide average and plotted as described in (a). 
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Supplementary Figure 7. Histogram of distances between neighboring peaks from in 
vitro and in vivo nucleosome sequence read profiles in S.cerevisiae. Mapped sequence reads 
were smoothed with a cr = 20 Gaussian. Neighboring peaks are defined by local maxima in the 
sequence read profile. 
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Supplementary Figure 8. Comparison of = 2 position-independent models trained 
on in vitro and in vivo S.cerevisiae nucleosomes. Rank-order plots of energies of 2 bp words: the 
energy of each word is ranked using a position-independent model of order N = 2 trained on either 
in vivo (with and without cross-linking) or in vitro nucleosome positioning data. Each curve shows 
the number of words whose ranks are separated in the in vivo vs. in vitro fits by a given distance or 
less. 
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Autocorrelation of nucleosome start positions 
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Supplementary Figure 9. Autocorrelation functions of nucleosome starting positions. 

Nucleosomes were assembled in vitro on the yeast genome. Black: original starting positions, 
violet: starting positions smoothed with a a = 2 Gaussian, red: starting positions smoothed with a 
(T = 20 Gaussian (see Supplementary Methods). 
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Supplementary Figure 10. Cross-validation of the = 5 position-independent and 
N = 2 spatially resolved models in S.cerevisiae. a) Rank-order plots of energies of 5 bp words: 
yeast genome is divided into 4 segments of equal size and the energy of each word is ranked using 
= 5 position-independent models independently trained on each segment. Each curve shows 
the number of words whose ranks are separated by a given distance or less. Energies of 5 bp-long 
words contain contributions from all shorter motifs: E{S) = Yln=i a„} ''^ai...an^ai...an, where 

i^ai...a„ is the number of times a given word was found within the 5 bp-long sequence S and eai...a„ 
is the fitted energy of that word, b) Rank-order plots of dinucleotide energies at each position 
predicted with N = 2 spatially resolved models independently trained on 47 segments of equal size. 
Dinucleotide energies at each position are computed using Ea^oi+i = ^OiOi+i + ^Oi, i = 4:.. . 142, 
-^ai43«i44 = £01430144 +eai43 +^"144 (Supplementary Methods) and ranked across all positions. The 
inset shows a histogram of rank-order correlation coefficients between dinucleotide energies trained 
on one of the segments, and all other segments. 
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Supplementary Tables 



Supplementary Table 1. Table of correlation coefFicients between predicted or observed 
occupancy profiles on the yeast genome. All observed profiles have been filtered for abnormally 
high- and low-density regions as described in the Supplementary Methods, with each correlation coef- 
ficient computed only for those basepairs that have not been removed from either dataset (predicted 
occupancies do not have filtered regions). The table is available at http;//nucleosome. rutgers.edu. 



Supplementary Table 2. Table of dinucleotide energies predicted by training N = 2 
position-independent models on several nucleosome positioning maps and nucleosome-free 
control experiments. Energies for each model have been rescaled to the variance of 1 a.u. 



Species/ Control 


S. cerevisiae 


E. Coli 


C. Elegans 


MNase 


Sonicated 


rank 


word 


energy 


word 


energy 


word 


energy 


word 


energy 


word 


energy 


1 


TT 


1.76 


TT 


2.14 


TT 


1.57 


TT 


1.40 


AT 


1.30 


2 


AA 


1.76 


AA 


2.14 


AA 


1.57 


AA 


1.40 


TA 


1.30 


3 


TA 


1.10 


TA 


0.59 


CG 


1.36 


AT 


0.90 


TT 


1.07 


4 


AT 


0.98 


CT 


0.26 


GC 


0.93 


GA 


0.62 


AA 


1.07 


5 


CT 


0.27 


AG 


0.26 


TA 


0.71 


TC 


0.62 


CT 


0.31 


6 


AG 


0.27 


AT 


0.25 


CC 


0.40 


AG 


0.32 


AG 


0.31 


7 


TC 


0.19 


GG 


0.09 


GG 


0.40 


CT 


0.32 


GA 


0.16 


8 


GA 


0.19 


CC 


0.09 


AT 


0.06 


TA 


0.32 


TC 


0.16 


9 


AC 


-0.50 


GC 


-0.15 


AG 


-0.69 


TG 


0.04 


TG 


0.05 


10 


GT 


-0.50 


GA 


-0.36 


CT 


-0.69 


CA 


0.04 


CA 


0.05 


11 


CA 


-0.55 


TC 


-0.36 


GT 


-0.73 


GT 


-0.24 


AC 


-0.09 


12 


TG 


-0.55 


TG 


-0.84 


AC 


-0.73 


AC 


-0.24 


GT 


-0.09 


13 


GG 


-0.81 


CA 


-0.84 


GA 


-0.80 


CC 


-0.77 


GG 


-0.86 


14 


CC 


-0.81 


CG 


-1.04 


TC 


-0.80 


GG 


-0.77 


CC 


-0.86 


15 


GC 


-1.40 


AC 


-1.12 


TG 


-1.28 


CG 


-1.97 


GC 


-1.79 


16 


CG 


-1.42 


GT 


-1.12 


CA 


-1.28 


GC 


-1.99 


CG 


-2.09 
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